{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1a558ee1",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/index_structs/struct_indices/SQLIndexDemo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e45f9b60-cd6b-4c15-958f-1feca5438128",
   "metadata": {},
   "source": [
    "# Text-to-SQL Guide (Query Engine + Retriever)\n",
    "\n",
    "This is a basic guide to LlamaIndex's Text-to-SQL capabilities. \n",
    "1. We first show how to perform text-to-SQL over a toy dataset: this will do \"retrieval\" (sql query over db) and \"synthesis\".\n",
    "2. We then show how to buid a TableIndex over the schema to dynamically retrieve relevant tables during query-time.\n",
    "3. Next, we show how to use query-time rows and columns retrievers to enhance Text-to-SQL context.\n",
    "4. We finally show you how to define a text-to-SQL retriever on its own.\n",
    "\n",
    "**NOTE:** Any Text-to-SQL application should be aware that executing \n",
    "arbitrary SQL queries can be a security risk. It is recommended to\n",
    "take precautions as needed, such as using restricted roles, read-only\n",
    "databases, sandboxing, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f3f3baa",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2480c7af",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-core llama-index-llms-openai llama-index-embeddings-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f69be8d-99ae-4d9f-91e4-b90fc62bcf2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e28ca96e-f98f-4a72-9fe3-a372dbd08a8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-..\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "119eb42b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import logging\n",
    "# import sys\n",
    "\n",
    "# logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
    "# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "107396a9-4aa7-49b3-9f0f-a755726c19ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import Markdown, display"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "461438c8-302d-45c5-8e69-16ad604686d1",
   "metadata": {},
   "source": [
    "### Create Database Schema\n",
    "\n",
    "We use `sqlalchemy`, a popular SQL database toolkit, to create an empty `city_stats` Table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a370b266-66f5-4624-bbf9-2ad57f0511f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlalchemy import (\n",
    "    create_engine,\n",
    "    MetaData,\n",
    "    Table,\n",
    "    Column,\n",
    "    String,\n",
    "    Integer,\n",
    "    select,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea24f794-f10b-42e6-922d-9258b7167405",
   "metadata": {},
   "outputs": [],
   "source": [
    "engine = create_engine(\"sqlite:///:memory:\")\n",
    "metadata_obj = MetaData()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4154b29-7e23-4c26-a507-370a66186ae7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create city SQL table\n",
    "table_name = \"city_stats\"\n",
    "city_stats_table = Table(\n",
    "    table_name,\n",
    "    metadata_obj,\n",
    "    Column(\"city_name\", String(16), primary_key=True),\n",
    "    Column(\"population\", Integer),\n",
    "    Column(\"country\", String(16), nullable=False),\n",
    ")\n",
    "metadata_obj.create_all(engine)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1c09089a-6bcd-48db-8120-a84c8da3f82e",
   "metadata": {},
   "source": [
    "### Define SQL Database\n",
    "\n",
    "We first define our `SQLDatabase` abstraction (a light wrapper around SQLAlchemy). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "768d1581-b482-4c73-9963-5ffd68a2aafb",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SQLDatabase\n",
    "from llama_index.llms.openai import OpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bffabba0-8e54-4f24-ad14-2c8979c582a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = OpenAI(temperature=0.1, model=\"gpt-4.1-mini\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9432787b-a8f0-4fc3-8323-e2cd9497df73",
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_database = SQLDatabase(engine, include_tables=[\"city_stats\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "bad7ffbe",
   "metadata": {},
   "source": [
    "We add some testing data to our SQL database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95043e10-6cdf-4f66-96bd-ce307ea7df3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlalchemy import insert\n",
    "\n",
    "sql_database = SQLDatabase(engine, include_tables=[\"city_stats\"])\n",
    "\n",
    "rows = [\n",
    "    {\"city_name\": \"Toronto\", \"population\": 2930000, \"country\": \"Canada\"},\n",
    "    {\"city_name\": \"Tokyo\", \"population\": 13960000, \"country\": \"Japan\"},\n",
    "    {\n",
    "        \"city_name\": \"Chicago\",\n",
    "        \"population\": 2679000,\n",
    "        \"country\": \"United States\",\n",
    "    },\n",
    "    {\n",
    "        \"city_name\": \"New York\",\n",
    "        \"population\": 8258000,\n",
    "        \"country\": \"United States\",\n",
    "    },\n",
    "    {\"city_name\": \"Seoul\", \"population\": 9776000, \"country\": \"South Korea\"},\n",
    "    {\"city_name\": \"Busan\", \"population\": 3334000, \"country\": \"South Korea\"},\n",
    "]\n",
    "for row in rows:\n",
    "    stmt = insert(city_stats_table).values(**row)\n",
    "    with engine.begin() as connection:\n",
    "        cursor = connection.execute(stmt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b315b8ff-7dd7-4e7d-ac47-8c5a0c3e7ae9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('Toronto', 2930000, 'Canada'), ('Tokyo', 13960000, 'Japan'), ('Chicago', 2679000, 'United States'), ('New York', 8258000, 'United States'), ('Seoul', 9776000, 'South Korea'), ('Busan', 3334000, 'South Korea')]\n"
     ]
    }
   ],
   "source": [
    "# view current table\n",
    "stmt = select(\n",
    "    city_stats_table.c.city_name,\n",
    "    city_stats_table.c.population,\n",
    "    city_stats_table.c.country,\n",
    ").select_from(city_stats_table)\n",
    "\n",
    "with engine.connect() as connection:\n",
    "    results = connection.execute(stmt).fetchall()\n",
    "    print(results)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "051a171f-8c97-40ed-ae17-4e3fa3785487",
   "metadata": {},
   "source": [
    "### Query Index"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f6a2303f-3bae-4fa2-8750-03f9af747848",
   "metadata": {},
   "source": [
    "We first show how we can execute a raw SQL query, which directly executes over the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eddd3608-31ff-4591-a02a-90987e312669",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('Busan',)\n",
      "('Chicago',)\n",
      "('New York',)\n",
      "('Seoul',)\n",
      "('Tokyo',)\n",
      "('Toronto',)\n"
     ]
    }
   ],
   "source": [
    "from sqlalchemy import text\n",
    "\n",
    "with engine.connect() as con:\n",
    "    rows = con.execute(text(\"SELECT city_name from city_stats\"))\n",
    "    for row in rows:\n",
    "        print(row)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4e72b931",
   "metadata": {},
   "source": [
    "## Part 1: Text-to-SQL Query Engine\n",
    "Once we have constructed our SQL database, we can use the NLSQLTableQueryEngine to\n",
    "construct natural language queries that are synthesized into SQL queries.\n",
    "\n",
    "Note that we need to specify the tables we want to use with this query engine.\n",
    "If we don't the query engine will pull all the schema context, which could\n",
    "overflow the context window of the LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d992fb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.query_engine import NLSQLTableQueryEngine\n",
    "\n",
    "query_engine = NLSQLTableQueryEngine(\n",
    "    sql_database=sql_database, tables=[\"city_stats\"], llm=llm\n",
    ")\n",
    "query_str = \"Which city has the highest population?\"\n",
    "response = query_engine.query(query_str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c0dfe9c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<b>Tokyo has the highest population among all cities, with a population of 13,960,000.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "298b4ca2",
   "metadata": {},
   "source": [
    "This query engine should be used in any case where you can specify the tables you want\n",
    "to query over beforehand, or the total size of all the table schema plus the rest of\n",
    "the prompt fits your context window."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "dee4d251",
   "metadata": {},
   "source": [
    "## Part 2: Query-Time Retrieval of Tables for Text-to-SQL\n",
    "If we don't know ahead of time which table we would like to use, and the total size of\n",
    "the table schema overflows your context window size, we should store the table schema \n",
    "in an index so that during query time we can retrieve the right schema.\n",
    "\n",
    "The way we can do this is using the SQLTableNodeMapping object, which takes in a \n",
    "SQLDatabase and produces a Node object for each SQLTableSchema object passed \n",
    "into the ObjectIndex constructor.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d71045c0-7a96-4e86-b38c-c378b7759aa4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.indices.struct_store.sql_query import (\n",
    "    SQLTableRetrieverQueryEngine,\n",
    ")\n",
    "from llama_index.core.objects import (\n",
    "    SQLTableNodeMapping,\n",
    "    ObjectIndex,\n",
    "    SQLTableSchema,\n",
    ")\n",
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.core.embeddings.openai import OpenAIEmbedding\n",
    "\n",
    "# set Logging to DEBUG for more detailed outputs\n",
    "table_node_mapping = SQLTableNodeMapping(sql_database)\n",
    "table_schema_objs = [\n",
    "    (SQLTableSchema(table_name=\"city_stats\"))\n",
    "]  # add a SQLTableSchema for each table\n",
    "\n",
    "obj_index = ObjectIndex.from_objects(\n",
    "    table_schema_objs,\n",
    "    table_node_mapping,\n",
    "    VectorStoreIndex,\n",
    "    embed_model=OpenAIEmbedding(model=\"text-embedding-3-small\"),\n",
    ")\n",
    "query_engine = SQLTableRetrieverQueryEngine(\n",
    "    sql_database, obj_index.as_retriever(similarity_top_k=1)\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b6156caf",
   "metadata": {},
   "source": [
    "Now we can take our SQLTableRetrieverQueryEngine and query it for our response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "802da9ed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<b>Tokyo has the highest population among all cities, with a population of 13,960,000.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = query_engine.query(\"Which city has the highest population?\")\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54a99cb0-578a-40ec-a3eb-1666ac18fbed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Tokyo', 13960000)]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# you can also fetch the raw result from SQLAlchemy!\n",
    "response.metadata[\"result\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0d19b9cd",
   "metadata": {},
   "source": [
    "You can also add additional context information for each table schema you define."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44a87651",
   "metadata": {},
   "outputs": [],
   "source": [
    "# manually set context text\n",
    "city_stats_text = (\n",
    "    \"This table gives information regarding the population and country of a\"\n",
    "    \" given city.\\nThe user will query with codewords, where 'foo' corresponds\"\n",
    "    \" to population and 'bar'corresponds to city.\"\n",
    ")\n",
    "\n",
    "table_node_mapping = SQLTableNodeMapping(sql_database)\n",
    "table_schema_objs = [\n",
    "    (SQLTableSchema(table_name=\"city_stats\", context_str=city_stats_text))\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b51919cc5ffde470",
   "metadata": {},
   "source": [
    "## Part 3: Query-Time Retrieval of Rows and Columns for Text-to-SQL\n",
    "\n",
    "One challenge arises when asking a question like: \"How many cities are in the US?\" In such cases, the generated query might only look for cities where the country is listed as \"US,\" potentially missing entries labeled \"United States.\" To address this issue, you can apply query-time row retrieval, query-time column retrieval, or a combination of both."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4add45639dabd5bb",
   "metadata": {},
   "source": [
    "### Query-Time Rows Retrieval\n",
    "\n",
    "In query-time rows retrieval, we embed the rows of each table, resulting in one index per table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55875c8c95e53258",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[NodeWithScore(node=TextNode(id_='8ae10176-afd8-40ee-a97b-b24f66235489', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\\n', text=\"('Chicago', 2679000, 'United States')\", mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\\n', text_template='{metadata_str}\\n\\n{content}'), score=0.7843469586763699)]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from llama_index.core.schema import TextNode\n",
    "\n",
    "with engine.connect() as connection:\n",
    "    results = connection.execute(stmt).fetchall()\n",
    "\n",
    "city_nodes = [TextNode(text=str(t)) for t in results]\n",
    "\n",
    "city_rows_index = VectorStoreIndex(\n",
    "    city_nodes, embed_model=OpenAIEmbedding(model=\"text-embedding-3-small\")\n",
    ")\n",
    "city_rows_retriever = city_rows_index.as_retriever(similarity_top_k=1)\n",
    "\n",
    "city_rows_retriever.retrieve(\"US\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2570ba19d40bff55",
   "metadata": {},
   "source": [
    "Then, the rows retriever of each table can be provided to the SQLTableRetrieverQueryEngine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f81ad4d63e7b5670",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows_retrievers = {\n",
    "    \"city_stats\": city_rows_retriever,\n",
    "}\n",
    "query_engine = SQLTableRetrieverQueryEngine(\n",
    "    sql_database,\n",
    "    obj_index.as_retriever(similarity_top_k=1),\n",
    "    rows_retrievers=rows_retrievers,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a153230128b5bfe",
   "metadata": {},
   "source": [
    "During querying, the row retrievers are used to identify the rows most semantically similar to the input query. These retrieved rows are then incorporated as context to enhance the performance of the text-to-SQL generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68d1198fbbf69f8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\"How many cities are in the US?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f4c7ec98e2e88d0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<b>There are 2 cities in the United States according to the data in the city_stats table.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1452dbdf88cf272a",
   "metadata": {},
   "source": [
    "### Query-Time Columns Retrieval\n",
    "While query-time row retrieval enhances text-to-SQL generation, it embeds each row individually, even when many rows share repeated values—such as those in categorical. This can lead to token inefficiency and unnecessary overhead. Moreover, in tables with a large number of columns, the retriever may surface only a subset of relevant values, potentially omitting others that are important for accurate query generation.\n",
    "\n",
    "To address this issue, query-time column retrieval can be used. This approach indexes each distinct value within selected columns, creating a separate index for each column in the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ac6d196eda1ed2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "city_cols_retrievers = {}\n",
    "\n",
    "for column_name in [\"city_name\", \"country\"]:\n",
    "    stmt = select(city_stats_table.c[column_name]).distinct()\n",
    "    with engine.connect() as connection:\n",
    "        values = connection.execute(stmt).fetchall()\n",
    "    nodes = [TextNode(text=t[0]) for t in values]\n",
    "\n",
    "    column_index = VectorStoreIndex(\n",
    "        nodes, embed_model=OpenAIEmbedding(model=\"text-embedding-3-small\")\n",
    "    )\n",
    "    column_retriever = column_index.as_retriever(similarity_top_k=1)\n",
    "\n",
    "    city_cols_retrievers[column_name] = column_retriever"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58c44af47fedebef",
   "metadata": {},
   "source": [
    "Then, columns retrievers of each table can be provided to the SQLTableRetrieverQueryEngine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49adc00ae6dc345b",
   "metadata": {},
   "outputs": [],
   "source": [
    "cols_retrievers = {\n",
    "    \"city_stats\": city_cols_retrievers,\n",
    "}\n",
    "query_engine = SQLTableRetrieverQueryEngine(\n",
    "    sql_database,\n",
    "    obj_index.as_retriever(similarity_top_k=1),\n",
    "    rows_retrievers=rows_retrievers,\n",
    "    cols_retrievers=cols_retrievers,\n",
    "    llm=llm,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65f391ebbdbc594b",
   "metadata": {},
   "source": [
    "During querying, the columns retrievers are used to identify the values of columns that are the most semantically similar to the input query. These retrieved values are then incorporated as context to enhance the performance of the text-to-SQL generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c5323ac8435a723",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\"How many cities are in the US?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3c4c88d3b77da78",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<b>There are 2 cities in the United States.</b>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ef57746-7cbf-48cd-8781-9c048ce3d3c7",
   "metadata": {},
   "source": [
    "## Part 4: Text-to-SQL Retriever\n",
    "\n",
    "So far our text-to-SQL capability is packaged in a query engine and consists of both retrieval and synthesis.\n",
    "\n",
    "You can use the SQL retriever on its own. We show you some different parameters you can try, and also show how to plug it into our `RetrieverQueryEngine` to get roughly the same results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99128819-6acf-4717-b703-9d1ca41190a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.retrievers import NLSQLRetriever\n",
    "\n",
    "# default retrieval (return_raw=True)\n",
    "nl_sql_retriever = NLSQLRetriever(\n",
    "    sql_database, tables=[\"city_stats\"], llm=llm, return_raw=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19bc6858-7e3b-4cf1-806f-0d6d63a55d61",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = nl_sql_retriever.retrieve(\n",
    "    \"Return the top 5 cities (along with their populations) with the highest population.\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3576523f-0b5c-4c40-8b2a-d734bdfbde58",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Node ID:** f640a54f-7413-4dc0-9135-cd63c7ca8f45<br>**Similarity:** None<br>**Text:** [('Tokyo', 13960000), ('Seoul', 9776000), ('New York', 8258000), ('Busan', 3334000), ('Toronto', ...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.core.response.notebook_utils import display_source_node\n",
    "\n",
    "for n in results:\n",
    "    display_source_node(n)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acfdc307-65e1-4a0e-8098-f30c0db60c93",
   "metadata": {},
   "outputs": [],
   "source": [
    "# default retrieval (return_raw=False)\n",
    "nl_sql_retriever = NLSQLRetriever(\n",
    "    sql_database, tables=[\"city_stats\"], return_raw=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36ccfa42-51e0-49e7-afac-6a0d011af4e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = nl_sql_retriever.retrieve(\n",
    "    \"Return the top 5 cities (along with their populations) with the highest population.\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cd14d5e-7a5c-4898-820a-469b80cee5c9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 05c61a90-598e-4c29-a6b4-b27f2579819e<br>**Similarity:** None<br>**Text:** <br>**Metadata:** {'city_name': 'Tokyo', 'population': 13960000}<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** c7f5fc4c-9754-4946-92c6-54a0d2b40fd9<br>**Similarity:** None<br>**Text:** <br>**Metadata:** {'city_name': 'Seoul', 'population': 9776000}<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 3a00e201-f3b5-430e-af0e-aa4c34a71131<br>**Similarity:** None<br>**Text:** <br>**Metadata:** {'city_name': 'New York', 'population': 8258000}<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** ee911f7f-8aae-4bad-a52d-c0bdfab63942<br>**Similarity:** None<br>**Text:** <br>**Metadata:** {'city_name': 'Busan', 'population': 3334000}<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** dca6b482-52e4-41e0-992f-a58109e6f3f6<br>**Similarity:** None<br>**Text:** <br>**Metadata:** {'city_name': 'Toronto', 'population': 2930000}<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# NOTE: all the content is in the metadata\n",
    "for n in results:\n",
    "    display_source_node(n, show_source_metadata=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "392889f6-babe-4bce-b802-4379a9b3ca49",
   "metadata": {},
   "source": [
    "### Plug into our `RetrieverQueryEngine`\n",
    "\n",
    "We compose our SQL Retriever with our standard `RetrieverQueryEngine` to synthesize a response. The result is roughly similar to our packaged `Text-to-SQL` query engines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f85e7a24-bbd0-4439-9da4-26a7a9e43b8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.query_engine import RetrieverQueryEngine\n",
    "\n",
    "query_engine = RetrieverQueryEngine.from_args(nl_sql_retriever, llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50f2dec0-8700-49d3-81b7-69c21518ac87",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\n",
    "    \"Return the top 5 cities (along with their populations) with the highest population.\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56eb7e5a-a23b-47ab-a154-98ce4c0d1c1c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The top 5 cities with the highest populations are:\n",
      "\n",
      "1. Tokyo - 13,960,000\n",
      "2. Seoul - 9,776,000\n",
      "3. New York - 8,258,000\n",
      "4. Busan - 3,334,000\n",
      "5. Toronto - 2,930,000\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_v2",
   "language": "python",
   "name": "llama_index_v2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
